3.3.5. Better Plots

3.3.5.1. The theory of good visuals

There is an enormous amount of scholarship and debate about what makes for effective graphs and I can’t possibly do the field justice. Below is simply one person’s distillation of some tips that are reasonably well agreed upon. I’m aiming for concise here so that we can practice, but if you want more, visit the links below and links in the last lecture.

Don’ts

  • pie charts: humans stink at interpreting angles

  • stacked bar charts: tough to decode trends

  • make your reader do math: if \(x-y\) is interesting, don’t plot \(x\) and \(y\) separately, just plot \(x-y\)

  • misleading scales

  • 3D unless absolutely necessary (and it almost surely isn’t)

  • distracting chart junk

  • unnecessary colors

  • spaghetti charts: too many lines

Do’s: slides 49-64

  • Show the data, reduce the clutter, and integrate the text and the graph

    • graphs should aspire to be sufficient to understand without reading the text

  • Control the aspect ratio

  • Think about whether you need to include zero. Sometimes excluding it makes the figure misleading. Sometimes including it (and expanding the y-axis to do so) hides the variation you’re describing.

  • Facilitate comparisons:

    • by placing figure components next to or above (depends!) the stuff it is compared to

    • by using the same axis (two y-axes is usually bad!)

    • labels > legends! (so readers eyes don’t have to dart back and forth)

    • sort in meaningful orders (i.e. not alphabetically!)

3.3.5.2. Transforming bad figures to good ones

3.3.5.2.1. Customizing figure aspects

  1. Create your plot in pandas or seaborn

  2. Format the figure as much as possible from within the pandas or seaborn function. I have some info on that below.

  3. If/when necessary, use matplotlib to customize the figure.

After you create a figure object, subsequent calls to that object will modify it

Copy this code below into a python file and run it. Then uncomment out the next line, and rerun. See the change it made. Then uncomment the next line, rerun, and so on.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 2, 100)

plt.plot(x, x, label='linear')       # creates plt obj
# plt.plot(x, x**2, label='quadratic') # adds another plot on top
# plt.plot(x, x**3, label='cubic')     # again

# plt.xlabel('x label')
# plt.ylabel('y label')
# plt.title("Simple Plot")

# plt.legend()

# plt.show()
[<matplotlib.lines.Line2D at 0x27c7adcf670>]
../../_images/04f-betterplots_4_1.png

For changes outside the pd and sns plot functions: Honestly, I can’t do better than this page.

```{dropdown} Formatting plots in pandas todo


```{dropdown} Formatting plots in  `seaborn`
todo

3.3.5.3. Practice: Thinking and planning

Questions: For Q1-Q3, which type of graph (bar, line, or histogram) would you use?

  1. The volume of apples picked at an orchard based on the type of apple (Granny Smith, Fuji, etcetera).

  2. The number of points for each game in a basketball season for a team.

  3. The count of apartment buildings in Chicago by the number of individual units.

  4. Suppose we create a scatter plot but find that due to the large number of points it’s hard to interpret. What are two things we can do to fix this issue?

  5. Suppose that we create an n-by-n FacetGrid. How big can “n” get?

  6. What are the two things about faceting which make it appealing?

  7. When is sns.pairplot most useful?

** Answers**

3.3.5.4. Interactive plots: plotly

I want to show you how far we can push this explore leverage and firm value. The code uses plotly’s subpackage plotly-express which is ridiculously easy to use, for how cool these plots are.

And as an exercise, you might critique these - I certainly think there are aspects to improve!

#!pip install plotly
%matplotlib inline
import pandas as pd
import numpy as np
import plotly.express as px # pip install plotly.. the animation below is from plotly module
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/CCM_cleaned_for_class.zip?raw=true'

#firms = pd.read_stata(url)   <-- would work, but GH said "too big" and forced me to zip it, 
# so here is the work around to download it:

with urlopen(url) as request:
    data = BytesIO(request.read())

with ZipFile(data) as archive:
    with archive.open(archive.namelist()[0]) as stata:
        firms = pd.read_stata(stata)

# firms = pd.read_stata('https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/CCM_cleaned_for_class.zip?raw=true')
firms.name = "Firms"
# https://jupyterbook.org/guide/05_faq.html#How-can-I-include-interactive-Plotly-figures?

# the lines before and after the fig help make sure this is viewable on the website 
# but shouldn't be necessary just for notebook viewing... but I'm not sure about github viewing

from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot
init_notebook_mode(connected=True)

fig =   (
        firms
            .query('(fyear < 2014) & (mb < 5) & (td_a >= 0) & (td_a < 1.5) ')         # some sensible limits
            .groupby(['state','gsector','fyear'])
            .agg({'td_a':'mean','mb':'mean','at':'sum','lpermno':'count'
                 }) # we need the # of firms per industry-state for an extra filter
                    # and I wanted the total assets summed so bigger industries get bigger circles
            .rename(columns={'td_a':'Avg Book Leverage', 'mb':'Avg Market to Book','lpermno':'Num_Firms'})     
            .query('Num_Firms > 20 ')    # disgard small industry-states
            .reset_index() # get fyear as a variable for plotting function
            .pipe( 
                 (px.scatter,'data_frame'), 
                  y='Avg Market to Book', x='Avg Book Leverage', animation_frame="fyear", 
                  range_x=[0,.5], range_y=[0,2], hover_data=["state","gsector"],
                  title = "State-By-Industry Avg Leverage and Avg Firm Value"
            )
        )
    
plot(fig, filename = 'ind-state mb vs lev.html')
display(HTML('ind-state mb vs lev.html'))
fig =   (
            firms
                .query('(fyear < 2014) & (mb < 5) & (td_a >= 0) & (td_a < 1.5) ')         # some sensible limits
                .query('state in ["CA","NY"] & gsector in ["40","45"]')  # sample restriction
                .rename(columns={'td_a':'Book Leverage'})    
                .reset_index() # get fyear as a variable for plotting function
                .pipe( 
                     (px.scatter,'data_frame'), 
                      y='mb',x='Book Leverage',animation_frame="fyear",
                      range_x=[0,1.5], range_y=[0,5], 
                      facet_row="gsector", facet_col="state",
                      hover_data=["state","gsector"],
                      title = "Leverage and Firm Value"
                )
        )
plot(fig, filename = 'mb vs lev for each state-ind.html')
display(HTML('mb vs lev for each state-ind.html'))

One more: This is a replication of a famous Hans Rosling TED talk figure using the well-known gapminder data:

fig = px.scatter(px.data.gapminder(), x="gdpPercap", y="lifeExp",
                    size="pop", color="continent",animation_frame="year",
                     range_y=[30,85],              
                    hover_name="country", log_x=True, size_max=60)
plot(fig, filename = 'hans.html')
display(HTML('hans.html'))